Xarray in 45 minutes
Contents
Xarray in 45 minutes#
In this lesson, we discuss cover the basics of Xarray data structures. By the end of the lesson, we will be able to:
Understand the basic data structures in Xarray
Inspect
DataArrayandDatasetobjects.Read and write netCDF files using Xarray.
Understand that there are many packages that build on top of xarray
We’ll start by reviewing the various components of the Xarray data model, represented here visually:
import matplotlib.pyplot as plt
import numpy as np
import xarray as xr
%matplotlib inline
%config InlineBackend.figure_format='retina'
# load tutorial dataset
ds = xr.tutorial.load_dataset("air_temperature")
What’s in a Dataset?#
Many DataArrays!
# dataset repr
ds
<xarray.Dataset>
Dimensions: (lat: 25, time: 2920, lon: 53)
Coordinates:
* lat (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
* lon (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
* time (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
Data variables:
air (time, lat, lon) float32 241.2 242.5 243.5 ... 296.5 296.2 295.7
Attributes:
Conventions: COARDS
title: 4x daily NMC reanalysis (1948)
description: Data is from NMC initialized reanalysis\n(4x/day). These a...
platform: Model
references: http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...Datasets are dict-like containers of DataArrays i.e. they are a mapping of variable name to DataArray.
# pull out "air" dataarray with dictionary syntax
ds["air"]
<xarray.DataArray 'air' (time: 2920, lat: 25, lon: 53)>
array([[[241.2 , 242.5 , 243.5 , ..., 232.79999, 235.5 ,
238.59999],
[243.79999, 244.5 , 244.7 , ..., 232.79999, 235.29999,
239.29999],
[250. , 249.79999, 248.89 , ..., 233.2 , 236.39 ,
241.7 ],
...,
[296.6 , 296.19998, 296.4 , ..., 295.4 , 295.1 ,
294.69998],
[295.9 , 296.19998, 296.79 , ..., 295.9 , 295.9 ,
295.19998],
[296.29 , 296.79 , 297.1 , ..., 296.9 , 296.79 ,
296.6 ]],
[[242.09999, 242.7 , 243.09999, ..., 232. , 233.59999,
235.79999],
[243.59999, 244.09999, 244.2 , ..., 231. , 232.5 ,
235.7 ],
[253.2 , 252.89 , 252.09999, ..., 230.79999, 233.39 ,
238.5 ],
...
[293.69 , 293.88998, 295.38998, ..., 295.09 , 294.69 ,
294.29 ],
[296.29 , 297.19 , 297.59 , ..., 295.29 , 295.09 ,
294.38998],
[297.79 , 298.38998, 298.49 , ..., 295.69 , 295.49 ,
295.19 ]],
[[245.09 , 244.29 , 243.29 , ..., 241.68999, 241.48999,
241.79 ],
[249.89 , 249.29 , 248.39 , ..., 239.59 , 240.29 ,
241.68999],
[262.99 , 262.19 , 261.38998, ..., 239.89 , 242.59 ,
246.29 ],
...,
[293.79 , 293.69 , 295.09 , ..., 295.29 , 295.09 ,
294.69 ],
[296.09 , 296.88998, 297.19 , ..., 295.69 , 295.69 ,
295.19 ],
[297.69 , 298.09 , 298.09 , ..., 296.49 , 296.19 ,
295.69 ]]], dtype=float32)
Coordinates:
* lat (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
* lon (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
* time (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
Attributes:
long_name: 4xDaily Air temperature at sigma level 995
units: degK
precision: 2
GRIB_id: 11
GRIB_name: TMP
var_desc: Air temperature
dataset: NMC Reanalysis
level_desc: Surface
statistic: Individual Obs
parent_stat: Other
actual_range: [185.16 322.1 ]You can save some typing by using the “attribute” or “dot” notation. This won’t
work for variable names that clash with a built-in method name (like mean for
example).
# pull out dataarray using dot notation
ds.air
What’s in a DataArray?#
data + (a lot of) metadata
Named dimensions#
(.dims)
ds.air.dims
('time', 'lat', 'lon')
Coordinate variables#
(.coords)
.coords is a simple data container
for coordinate variables.
ds.air.coords
Coordinates:
* lat (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
* lon (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
* time (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
Coordinates objects support similar indexing notation
# extracting coordinate variables
ds.air.lon
<xarray.DataArray 'lon' (lon: 53)>
array([200. , 202.5, 205. , 207.5, 210. , 212.5, 215. , 217.5, 220. , 222.5,
225. , 227.5, 230. , 232.5, 235. , 237.5, 240. , 242.5, 245. , 247.5,
250. , 252.5, 255. , 257.5, 260. , 262.5, 265. , 267.5, 270. , 272.5,
275. , 277.5, 280. , 282.5, 285. , 287.5, 290. , 292.5, 295. , 297.5,
300. , 302.5, 305. , 307.5, 310. , 312.5, 315. , 317.5, 320. , 322.5,
325. , 327.5, 330. ], dtype=float32)
Coordinates:
* lon (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
Attributes:
standard_name: longitude
long_name: Longitude
units: degrees_east
axis: X# extracting coordinate variables from .coords
ds.coords["lon"]
<xarray.DataArray 'lon' (lon: 53)>
array([200. , 202.5, 205. , 207.5, 210. , 212.5, 215. , 217.5, 220. , 222.5,
225. , 227.5, 230. , 232.5, 235. , 237.5, 240. , 242.5, 245. , 247.5,
250. , 252.5, 255. , 257.5, 260. , 262.5, 265. , 267.5, 270. , 272.5,
275. , 277.5, 280. , 282.5, 285. , 287.5, 290. , 292.5, 295. , 297.5,
300. , 302.5, 305. , 307.5, 310. , 312.5, 315. , 317.5, 320. , 322.5,
325. , 327.5, 330. ], dtype=float32)
Coordinates:
* lon (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
Attributes:
standard_name: longitude
long_name: Longitude
units: degrees_east
axis: XIt is useful to think of the values in these coordinate variables as axis “labels” such as “tick labels” in a figure. These are coordinate locations on a grid at which you have data.
Arbitrary attributes#
(.attrs)
.attrs is a dictionary that can contain arbitrary python objects. Your only
limitation is that some attributes may not be writeable to a certain file formats
ds.air.attrs
{'long_name': '4xDaily Air temperature at sigma level 995',
'units': 'degK',
'precision': 2,
'GRIB_id': 11,
'GRIB_name': 'TMP',
'var_desc': 'Air temperature',
'dataset': 'NMC Reanalysis',
'level_desc': 'Surface',
'statistic': 'Individual Obs',
'parent_stat': 'Other',
'actual_range': array([185.16, 322.1 ], dtype=float32)}
# assign your own attribute
ds.air.attrs["who_is_awesome"] = "xarray"
ds.air.attrs
{'long_name': '4xDaily Air temperature at sigma level 995',
'units': 'degK',
'precision': 2,
'GRIB_id': 11,
'GRIB_name': 'TMP',
'var_desc': 'Air temperature',
'dataset': 'NMC Reanalysis',
'level_desc': 'Surface',
'statistic': 'Individual Obs',
'parent_stat': 'Other',
'actual_range': array([185.16, 322.1 ], dtype=float32),
'who_is_awesome': 'xarray'}
Underlying data#
(.data)
Xarray structures wrap underlying simpler data structures. In this case, the underlying data is a numpy array which you may be familiar with.
This part of xarray is quite extensible allowing for GPU arrays, sparse arrays, arrays with units etc. See the demo at the end.
ds.air.data
array([[[241.2 , 242.5 , 243.5 , ..., 232.79999, 235.5 ,
238.59999],
[243.79999, 244.5 , 244.7 , ..., 232.79999, 235.29999,
239.29999],
[250. , 249.79999, 248.89 , ..., 233.2 , 236.39 ,
241.7 ],
...,
[296.6 , 296.19998, 296.4 , ..., 295.4 , 295.1 ,
294.69998],
[295.9 , 296.19998, 296.79 , ..., 295.9 , 295.9 ,
295.19998],
[296.29 , 296.79 , 297.1 , ..., 296.9 , 296.79 ,
296.6 ]],
[[242.09999, 242.7 , 243.09999, ..., 232. , 233.59999,
235.79999],
[243.59999, 244.09999, 244.2 , ..., 231. , 232.5 ,
235.7 ],
[253.2 , 252.89 , 252.09999, ..., 230.79999, 233.39 ,
238.5 ],
...,
[296.4 , 295.9 , 296.19998, ..., 295.4 , 295.1 ,
294.79 ],
[296.19998, 296.69998, 296.79 , ..., 295.6 , 295.5 ,
295.1 ],
[296.29 , 297.19998, 297.4 , ..., 296.4 , 296.4 ,
296.6 ]],
[[242.29999, 242.2 , 242.29999, ..., 234.29999, 236.09999,
238.7 ],
[244.59999, 244.39 , 244. , ..., 230.29999, 232. ,
235.7 ],
[256.19998, 255.5 , 254.2 , ..., 231.2 , 233.2 ,
238.2 ],
...,
[295.6 , 295.4 , 295.4 , ..., 296.29 , 295.29 ,
295. ],
[296.19998, 296.5 , 296.29 , ..., 296.4 , 296. ,
295.6 ],
[296.4 , 296.29 , 296.4 , ..., 297. , 297. ,
296.79 ]],
...,
[[243.48999, 242.98999, 242.09 , ..., 244.18999, 244.48999,
244.89 ],
[249.09 , 248.98999, 248.59 , ..., 240.59 , 241.29 ,
242.68999],
[262.69 , 262.19 , 261.69 , ..., 239.39 , 241.68999,
245.18999],
...,
[294.79 , 295.29 , 297.49 , ..., 295.49 , 295.38998,
294.69 ],
[296.79 , 297.88998, 298.29 , ..., 295.49 , 295.49 ,
294.79 ],
[298.19 , 299.19 , 298.79 , ..., 296.09 , 295.79 ,
295.79 ]],
[[245.79 , 244.79 , 243.48999, ..., 243.29 , 243.98999,
244.79 ],
[249.89 , 249.29 , 248.48999, ..., 241.29 , 242.48999,
244.29 ],
[262.38998, 261.79 , 261.29 , ..., 240.48999, 243.09 ,
246.89 ],
...,
[293.69 , 293.88998, 295.38998, ..., 295.09 , 294.69 ,
294.29 ],
[296.29 , 297.19 , 297.59 , ..., 295.29 , 295.09 ,
294.38998],
[297.79 , 298.38998, 298.49 , ..., 295.69 , 295.49 ,
295.19 ]],
[[245.09 , 244.29 , 243.29 , ..., 241.68999, 241.48999,
241.79 ],
[249.89 , 249.29 , 248.39 , ..., 239.59 , 240.29 ,
241.68999],
[262.99 , 262.19 , 261.38998, ..., 239.89 , 242.59 ,
246.29 ],
...,
[293.79 , 293.69 , 295.09 , ..., 295.29 , 295.09 ,
294.69 ],
[296.09 , 296.88998, 297.19 , ..., 295.69 , 295.69 ,
295.19 ],
[297.69 , 298.09 , 298.09 , ..., 296.49 , 296.19 ,
295.69 ]]], dtype=float32)
# what is the type of the underlying data
type(ds.air.data)
numpy.ndarray
A numpy array!
Review#
Xarray provides two main data structures
DataArrays that wrap underlying data containers (e.g. numpy arrays) and contain associated metadata
Datasets that are dict-like containers of DataArrays
For more see
Why Xarray?#
Use metadata for fun and ~profit~ papers!
Analysis without xarray: X(#
# plot the first timestep
lat = ds.air.lat.data # numpy array
lon = ds.air.lon.data # numpy array
temp = ds.air.data # numpy array
plt.figure()
plt.pcolormesh(lon, lat, temp[0, :, :]);
temp.mean(axis=1) ## what did I just do? I can't tell by looking at this line.
array([[279.39798, 279.6664 , 279.66122, ..., 279.9508 , 280.31522,
280.6624 ],
[279.05722, 279.538 , 279.7296 , ..., 279.77563, 280.27002,
280.79764],
[279.0104 , 279.2808 , 279.5508 , ..., 279.682 , 280.19763,
280.81403],
...,
[279.63 , 279.934 , 280.534 , ..., 279.802 , 280.346 ,
280.77798],
[279.398 , 279.66602, 280.31796, ..., 279.766 , 280.34198,
280.834 ],
[279.27 , 279.354 , 279.88202, ..., 279.42596, 279.96997,
280.48196]], dtype=float32)
Analysis with xarray =)#
How readable is this code?
ds.air.isel(time=1).plot(x="lon");
Use dimension names instead of axis numbers
ds.air.mean("time")
<xarray.DataArray 'air' (lat: 25, lon: 53)>
array([[260.37564, 260.1826 , 259.88593, ..., 250.81511, 251.93733,
253.43741],
[262.7337 , 262.7936 , 262.7489 , ..., 249.75496, 251.5852 ,
254.35849],
[264.7681 , 264.3271 , 264.0614 , ..., 250.60707, 253.58247,
257.71475],
...,
[297.64932, 296.95294, 296.62912, ..., 296.81033, 296.28793,
295.81622],
[298.1287 , 297.93646, 297.47006, ..., 296.8591 , 296.77686,
296.44348],
[298.36594, 298.38593, 298.11386, ..., 297.33777, 297.28104,
297.30502]], dtype=float32)
Coordinates:
* lat (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
* lon (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0Extracting data or “indexing”#
(.sel, .isel)
Xarray supports
label-based indexing using
.selposition-based indexing using
.isel
For more see https://xarray.pydata.org/en/stable/indexing.html
Label-based indexing#
Xarray inherits its label-based indexing rules from pandas; this means great support for dates and times!
# pull out data for all of 2013-May
ds.sel(time="2013-05")
<xarray.Dataset>
Dimensions: (lat: 25, time: 124, lon: 53)
Coordinates:
* lat (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
* lon (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
* time (time) datetime64[ns] 2013-05-01 ... 2013-05-31T18:00:00
Data variables:
air (time, lat, lon) float32 259.2 259.3 259.1 ... 298.2 297.6 297.5
Attributes:
Conventions: COARDS
title: 4x daily NMC reanalysis (1948)
description: Data is from NMC initialized reanalysis\n(4x/day). These a...
platform: Model
references: http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...# demonstrate slicing
ds.sel(time=slice("2013-05", "2013-07"))
<xarray.Dataset>
Dimensions: (lat: 25, time: 368, lon: 53)
Coordinates:
* lat (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
* lon (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
* time (time) datetime64[ns] 2013-05-01 ... 2013-07-31T18:00:00
Data variables:
air (time, lat, lon) float32 259.2 259.3 259.1 ... 299.4 299.5 299.7
Attributes:
Conventions: COARDS
title: 4x daily NMC reanalysis (1948)
description: Data is from NMC initialized reanalysis\n(4x/day). These a...
platform: Model
references: http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...# demonstrate "nearest" indexing
ds.sel(lon=240.2, method="nearest")
<xarray.Dataset>
Dimensions: (lat: 25, time: 2920)
Coordinates:
* lat (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
lon float32 240.0
* time (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
Data variables:
air (time, lat) float32 239.6 237.2 240.1 249.0 ... 294.8 296.9 298.4
Attributes:
Conventions: COARDS
title: 4x daily NMC reanalysis (1948)
description: Data is from NMC initialized reanalysis\n(4x/day). These a...
platform: Model
references: http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...# "nearest indexing at multiple points"
ds.sel(lon=[240.125, 234], lat=[40.3, 50.3], method="nearest")
<xarray.Dataset>
Dimensions: (lat: 2, time: 2920, lon: 2)
Coordinates:
* lat (lat) float32 40.0 50.0
* lon (lon) float32 240.0 235.0
* time (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
Data variables:
air (time, lat, lon) float32 268.1 283.0 265.5 ... 285.2 256.8 268.6
Attributes:
Conventions: COARDS
title: 4x daily NMC reanalysis (1948)
description: Data is from NMC initialized reanalysis\n(4x/day). These a...
platform: Model
references: http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...Position-based indexing#
This is similar to your usual numpy array[0, 2, 3] but with the power of named
dimensions!
# pull out time index 0 and lat index 0
ds.air.isel(time=0, lat=0) # much better than ds.air[0, 0, :]
<xarray.DataArray 'air' (lon: 53)>
array([241.2 , 242.5 , 243.5 , 244. , 244.09999, 243.89 ,
243.59999, 243.09999, 242.5 , 241.89 , 241.2 , 240.29999,
239.5 , 238.79999, 238.5 , 238.7 , 239.59999, 241. ,
242.89 , 244.79999, 246.5 , 247.79999, 248.59999, 249. ,
249.09999, 249.09999, 249. , 248.89 , 248.7 , 248.59999,
248.39 , 248.29999, 248.29999, 248.59999, 249. , 249.5 ,
249.59999, 249.09999, 247.79999, 245.39 , 242.2 , 238.5 ,
234.7 , 231.29999, 228.79999, 227.29999, 227. , 227.5 ,
228.79999, 230.59999, 232.79999, 235.5 , 238.59999],
dtype=float32)
Coordinates:
lat float32 75.0
* lon (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
time datetime64[ns] 2013-01-01
Attributes:
long_name: 4xDaily Air temperature at sigma level 995
units: degK
precision: 2
GRIB_id: 11
GRIB_name: TMP
var_desc: Air temperature
dataset: NMC Reanalysis
level_desc: Surface
statistic: Individual Obs
parent_stat: Other
actual_range: [185.16 322.1 ]
who_is_awesome: xarray# demonstrate slicing
ds.air.isel(lat=slice(10))
<xarray.DataArray 'air' (time: 2920, lat: 10, lon: 53)>
array([[[241.2 , 242.5 , 243.5 , ..., 232.79999, 235.5 ,
238.59999],
[243.79999, 244.5 , 244.7 , ..., 232.79999, 235.29999,
239.29999],
[250. , 249.79999, 248.89 , ..., 233.2 , 236.39 ,
241.7 ],
...,
[274.79 , 275.19998, 275.6 , ..., 277.19998, 277. ,
277. ],
[275.9 , 276.9 , 276.9 , ..., 280.9 , 280.5 ,
279.69998],
[276.69998, 277.4 , 277.69998, ..., 283.29 , 284.1 ,
283.9 ]],
[[242.09999, 242.7 , 243.09999, ..., 232. , 233.59999,
235.79999],
[243.59999, 244.09999, 244.2 , ..., 231. , 232.5 ,
235.7 ],
[253.2 , 252.89 , 252.09999, ..., 230.79999, 233.39 ,
238.5 ],
...
[275.59 , 276.29 , 277.49 , ..., 275.19 , 275.79 ,
276.59 ],
[276.88998, 277.88998, 278.69 , ..., 273.59 , 274.29 ,
275.29 ],
[276.79 , 277.29 , 278.29 , ..., 274.19 , 275.38998,
276.88998]],
[[245.09 , 244.29 , 243.29 , ..., 241.68999, 241.48999,
241.79 ],
[249.89 , 249.29 , 248.39 , ..., 239.59 , 240.29 ,
241.68999],
[262.99 , 262.19 , 261.38998, ..., 239.89 , 242.59 ,
246.29 ],
...,
[274.29 , 274.49 , 275.59 , ..., 274.69 , 274.99 ,
275.38998],
[276.79 , 277.49 , 277.99 , ..., 273.19 , 273.59 ,
274.19 ],
[276.88998, 277.29 , 277.59 , ..., 273.79 , 274.99 ,
276.19 ]]], dtype=float32)
Coordinates:
* lat (lat) float32 75.0 72.5 70.0 67.5 65.0 62.5 60.0 57.5 55.0 52.5
* lon (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
* time (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
Attributes:
long_name: 4xDaily Air temperature at sigma level 995
units: degK
precision: 2
GRIB_id: 11
GRIB_name: TMP
var_desc: Air temperature
dataset: NMC Reanalysis
level_desc: Surface
statistic: Individual Obs
parent_stat: Other
actual_range: [185.16 322.1 ]
who_is_awesome: xarrayConcepts for computation#
Broadcasting: expanding data#
Let’s try to calculate grid cell area associated with the air temperature data. We may want this to make a proper area-weighted domain-average for example
A very approximate formula is
assuming that \(Δlon\) = 111km and \(Δlat\) = 111km
dlon = np.cos(ds.air.lat * np.pi / 180) * 111e3
dlon
<xarray.DataArray 'lat' (lat: 25)>
array([ 28728.904, 33378.348, 37964.227, 42477.863, 46910.63 ,
51254.094, 55499.996, 59640.254, 63666.984, 67572.516,
71349.42 , 74990.51 , 78488.85 , 81837.78 , 85030.93 ,
88062.22 , 90925.875, 93616.445, 96128.82 , 98458.2 ,
100600.164, 102550.625, 104305.88 , 105862.58 , 107217.766],
dtype=float32)
Coordinates:
* lat (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0dlat = 111e3 * xr.ones_like(ds.air.lon)
dlat
<xarray.DataArray 'lon' (lon: 53)>
array([111000., 111000., 111000., 111000., 111000., 111000., 111000.,
111000., 111000., 111000., 111000., 111000., 111000., 111000.,
111000., 111000., 111000., 111000., 111000., 111000., 111000.,
111000., 111000., 111000., 111000., 111000., 111000., 111000.,
111000., 111000., 111000., 111000., 111000., 111000., 111000.,
111000., 111000., 111000., 111000., 111000., 111000., 111000.,
111000., 111000., 111000., 111000., 111000., 111000., 111000.,
111000., 111000., 111000., 111000.], dtype=float32)
Coordinates:
* lon (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0cell_area = dlon * dlat
cell_area
<xarray.DataArray (lat: 25, lon: 53)>
array([[3.1889083e+09, 3.1889083e+09, 3.1889083e+09, ..., 3.1889083e+09,
3.1889083e+09, 3.1889083e+09],
[3.7049966e+09, 3.7049966e+09, 3.7049966e+09, ..., 3.7049966e+09,
3.7049966e+09, 3.7049966e+09],
[4.2140291e+09, 4.2140291e+09, 4.2140291e+09, ..., 4.2140291e+09,
4.2140291e+09, 4.2140291e+09],
...,
[1.1577953e+10, 1.1577953e+10, 1.1577953e+10, ..., 1.1577953e+10,
1.1577953e+10, 1.1577953e+10],
[1.1750746e+10, 1.1750746e+10, 1.1750746e+10, ..., 1.1750746e+10,
1.1750746e+10, 1.1750746e+10],
[1.1901172e+10, 1.1901172e+10, 1.1901172e+10, ..., 1.1901172e+10,
1.1901172e+10, 1.1901172e+10]], dtype=float32)
Coordinates:
* lat (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
* lon (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0The result has two dimensions because xarray realizes that dimensions lon and
lat are different so it automatically “broadcasts” to get a 2D result. See the
last row in this image from Jake VanderPlas Python Data Science Handbook
Because xarray knows about dimension names we avoid having to create unnecessary
size-1 dimensions using np.newaxis or .reshape. For more, see
https://xarray.pydata.org/en/stable/computation.html#broadcasting-by-dimension-name
Alignment: putting data on the same grid#
When doing arithmetic operations xarray automatically “aligns” i.e. puts the
data on the same grid. In this case cell_area and ds.air are at the same
lat, lon points so things are multiplied as you would expect
(cell_area * ds.air.isel(time=1))
<xarray.DataArray (lat: 25, lon: 53)>
array([[7.72034658e+11, 7.73948047e+11, 7.75223575e+11, ...,
7.39826729e+11, 7.44928969e+11, 7.51944532e+11],
[9.02537150e+11, 9.04389657e+11, 9.04760132e+11, ...,
8.55854219e+11, 8.61411738e+11, 8.73267659e+11],
[1.06699214e+12, 1.06568581e+12, 1.06235671e+12, ...,
9.72597887e+11, 9.83512252e+11, 1.00504594e+12],
...,
[3.43170535e+12, 3.42591642e+12, 3.42938957e+12, ...,
3.42012723e+12, 3.41665409e+12, 3.41306507e+12],
[3.48057082e+12, 3.48644626e+12, 3.48750401e+12, ...,
3.47352072e+12, 3.47234553e+12, 3.46764529e+12],
[3.52619830e+12, 3.53702799e+12, 3.53940852e+12, ...,
3.52750718e+12, 3.52750718e+12, 3.52988771e+12]], dtype=float32)
Coordinates:
* lat (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
* lon (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
time datetime64[ns] 2013-01-01T06:00:00Now lets make cell_area unaligned i.e. change the coordinate labels
# make a copy of cell_area
# then add 1e-5 degrees to latitude
cell_area_bad = cell_area.copy(deep=True)
cell_area_bad["lat"] = cell_area.lat + 1e-5
cell_area_bad
<xarray.DataArray (lat: 25, lon: 53)>
array([[3.1889083e+09, 3.1889083e+09, 3.1889083e+09, ..., 3.1889083e+09,
3.1889083e+09, 3.1889083e+09],
[3.7049966e+09, 3.7049966e+09, 3.7049966e+09, ..., 3.7049966e+09,
3.7049966e+09, 3.7049966e+09],
[4.2140291e+09, 4.2140291e+09, 4.2140291e+09, ..., 4.2140291e+09,
4.2140291e+09, 4.2140291e+09],
...,
[1.1577953e+10, 1.1577953e+10, 1.1577953e+10, ..., 1.1577953e+10,
1.1577953e+10, 1.1577953e+10],
[1.1750746e+10, 1.1750746e+10, 1.1750746e+10, ..., 1.1750746e+10,
1.1750746e+10, 1.1750746e+10],
[1.1901172e+10, 1.1901172e+10, 1.1901172e+10, ..., 1.1901172e+10,
1.1901172e+10, 1.1901172e+10]], dtype=float32)
Coordinates:
* lat (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
* lon (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0cell_area_bad * ds.air.isel(time=1)
<xarray.DataArray (lat: 0, lon: 53)>
array([], shape=(0, 53), dtype=float32)
Coordinates:
* lat (lat) float64
* lon (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
time datetime64[ns] 2013-01-01T06:00:00Tip: If you notice extra NaNs or missing points after xarray computation, it means that your xarray coordinates were not aligned exactly.
For more, see https://xarray.pydata.org/en/stable/computation.html#automatic-alignment
High level computation#
(groupby, resample, rolling, coarsen, weighted)
Xarray has some very useful high level objects that let you do common computations:
groupby: Bin data in to groups and reduceresample: Groupby specialized for time axes. Either downsample or upsample your data.rolling: Operate on rolling windows of your data e.g. running meancoarsen: Downsample your dataweighted: Weight your data before reducing
groupby#
# seasonal groups
ds.groupby("time.season")
DatasetGroupBy, grouped over 'season'
4 groups with labels 'DJF', 'JJA', 'MAM', 'SON'.
# make a seasonal mean
seasonal_mean = ds.groupby("time.season").mean()
seasonal_mean
<xarray.Dataset>
Dimensions: (lat: 25, season: 4, lon: 53)
Coordinates:
* lat (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
* lon (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
* season (season) object 'DJF' 'JJA' 'MAM' 'SON'
Data variables:
air (season, lat, lon) float32 247.0 247.0 246.7 ... 299.4 299.4 299.5The seasons are out of order (they are alphabetically sorted). This is a common
annoyance. The solution is to use .reindex
seasonal_mean = seasonal_mean.reindex(season=["DJF", "MAM", "JJA", "SON"])
seasonal_mean
<xarray.Dataset>
Dimensions: (season: 4, lat: 25, lon: 53)
Coordinates:
* season (season) <U3 'DJF' 'MAM' 'JJA' 'SON'
* lat (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
* lon (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
Data variables:
air (season, lat, lon) float32 247.0 247.0 246.7 ... 299.4 299.4 299.5resample#
# resample to monthly frequency
ds.resample(time="M").mean()
<xarray.Dataset>
Dimensions: (time: 24, lat: 25, lon: 53)
Coordinates:
* time (time) datetime64[ns] 2013-01-31 2013-02-28 ... 2014-12-31
* lat (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
* lon (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
Data variables:
air (time, lat, lon) float32 244.5 244.7 244.7 ... 297.7 297.7 297.7weighted#
# weight by cell_area and take mean over (time, lon)
ds.weighted(cell_area).mean(["lon", "time"]).air.plot()
[<matplotlib.lines.Line2D at 0x7f5eeec412d0>]
Visualization#
(.plot)
For more see https://xarray.pydata.org/en/stable/plotting.html and https://xarray.pydata.org/en/stable/examples/visualization_gallery.html
We have seen very simple plots earlier. Xarray has some support for visualizing 3D and 4D datasets by presenting multiple facets (or panels or subplots) showing variations across rows and/or columns.
# facet the seasonal_mean
seasonal_mean.air.plot(col="season")
<xarray.plot.facetgrid.FacetGrid at 0x7f5eeef36bc0>
# contours
seasonal_mean.air.plot.contour(col="season", levels=20, add_colorbar=True)
<xarray.plot.facetgrid.FacetGrid at 0x7f5f1ce9d0c0>
# line plots too? wut
seasonal_mean.air.mean("lon").plot.line(hue="season", y="lat")
[<matplotlib.lines.Line2D at 0x7f5eee7fcb80>,
<matplotlib.lines.Line2D at 0x7f5eee7fcbe0>,
<matplotlib.lines.Line2D at 0x7f5eee7fcd00>,
<matplotlib.lines.Line2D at 0x7f5eee7fce20>]
Reading and writing files#
Xarray supports many disk formats. Below is a small example using netCDF. For more see https://xarray.pydata.org/en/stable/io.html
# write to netCDF
ds.to_netcdf("my-example-dataset.nc")
/tmp/ipykernel_2068/1659397704.py:2: SerializationWarning: saving variable air with floating point data as an integer dtype without any _FillValue to use for NaNs
ds.to_netcdf("my-example-dataset.nc")
# read from disk
fromdisk = xr.open_dataset("my-example-dataset.nc")
fromdisk
<xarray.Dataset>
Dimensions: (lat: 25, time: 2920, lon: 53)
Coordinates:
* lat (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
* lon (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
* time (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
Data variables:
air (time, lat, lon) float32 ...
Attributes:
Conventions: COARDS
title: 4x daily NMC reanalysis (1948)
description: Data is from NMC initialized reanalysis\n(4x/day). These a...
platform: Model
references: http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...# check that the two are identical
ds.identical(fromdisk)
True
Tip: A common use case to read datasets that are a collection of many netCDF files. See https://xarray.pydata.org/en/stable/io.html#reading-multi-file-datasets for how to handle that
The scientific python ecosystem#
Xarray ties in to the larger scientific python ecosystem and in turn many packages build on top of xarray. A long list of such packages is here: https://xarray.pydata.org/en/stable/related-projects.html.
Now we will demonstrate some cool features.
Pandas: tabular data structures#
You can easily convert between xarray and pandas structures: https://pandas.pydata.org/
This allows you to conveniently use the extensive pandas ecosystem of packages (like seaborn) for your work.
See https://xarray.pydata.org/en/stable/pandas.html
# convert to pandas dataframe
df = ds.isel(time=slice(10)).to_dataframe()
df
| air | |||
|---|---|---|---|
| lat | time | lon | |
| 75.0 | 2013-01-01 00:00:00 | 200.0 | 241.199997 |
| 202.5 | 242.500000 | ||
| 205.0 | 243.500000 | ||
| 207.5 | 244.000000 | ||
| 210.0 | 244.099991 | ||
| ... | ... | ... | ... |
| 15.0 | 2013-01-03 06:00:00 | 320.0 | 297.000000 |
| 322.5 | 297.290009 | ||
| 325.0 | 296.899994 | ||
| 327.5 | 296.790009 | ||
| 330.0 | 297.100006 |
13250 rows × 1 columns
# convert dataframe to xarray
df.to_xarray()
<xarray.Dataset>
Dimensions: (lat: 25, time: 10, lon: 53)
Coordinates:
* lat (lat) float64 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
* time (time) datetime64[ns] 2013-01-01 ... 2013-01-03T06:00:00
* lon (lon) float64 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
Data variables:
air (lat, time, lon) float32 241.2 242.5 243.5 ... 296.9 296.8 297.1Numpy alternatives#
Xarray can wrap other array types! For example:
dask : parallel arrays https://xarray.pydata.org/en/stable/dask.html & https://docs.dask.org/en/latest/array.html
pydata/sparse : sparse arrays http://sparse.pydata.org
cupy : GPU arrays http://cupy.chainer.org
pint : unit-aware computations https://pint.readthedocs.org & https://github.com/xarray-contrib/pint-xarray
Dask#
Dask cuts up NumPy arrays into blocks and parallelizes your analysis code across these blocks
# demonstrate dask dataset
dasky = xr.tutorial.open_dataset(
"air_temperature",
chunks={"time": 10}, # 10 time steps in each block
)
dasky.air
<xarray.DataArray 'air' (time: 2920, lat: 25, lon: 53)>
dask.array<open_dataset-18f359f1029179d83daace1d7a89c4f6air, shape=(2920, 25, 53), dtype=float32, chunksize=(10, 25, 53), chunktype=numpy.ndarray>
Coordinates:
* lat (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
* lon (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
* time (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
Attributes:
long_name: 4xDaily Air temperature at sigma level 995
units: degK
precision: 2
GRIB_id: 11
GRIB_name: TMP
var_desc: Air temperature
dataset: NMC Reanalysis
level_desc: Surface
statistic: Individual Obs
parent_stat: Other
actual_range: [185.16 322.1 ]All computations with dask-backed xarray objects are lazy, allowing you to build up a complicated chain of analysis steps quickly
# demonstrate lazy mean
dasky.air.mean("lat")
<xarray.DataArray 'air' (time: 2920, lon: 53)> dask.array<mean_agg-aggregate, shape=(2920, 53), dtype=float32, chunksize=(10, 53), chunktype=numpy.ndarray> Coordinates: * lon (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0 * time (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
To get concrete values, call .compute or .load
# "compute" the mean
dasky.air.mean("lat").compute()
<xarray.DataArray 'air' (time: 2920, lon: 53)>
array([[279.39798, 279.6664 , 279.66122, ..., 279.9508 , 280.31522,
280.6624 ],
[279.05722, 279.538 , 279.7296 , ..., 279.77563, 280.27002,
280.79764],
[279.0104 , 279.2808 , 279.5508 , ..., 279.682 , 280.19763,
280.81403],
...,
[279.63 , 279.934 , 280.534 , ..., 279.802 , 280.346 ,
280.77798],
[279.398 , 279.66602, 280.31796, ..., 279.766 , 280.34198,
280.834 ],
[279.27 , 279.354 , 279.88202, ..., 279.42596, 279.96997,
280.48196]], dtype=float32)
Coordinates:
* lon (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
* time (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00HoloViz#
Quickly generate interactive plots from your data!
The hvplot package attaches itself to all
xarray objects under the .hvplot namespace. So instead of using .plot use .hvplot
import hvplot.xarray
ds.air.hvplot(groupby="time", clim=(270, 300), widget_location='bottom')
Note
The time slider will only work if you’re executing the notebook, rather than viewing the website
cf_xarray#
cf_xarray is a project that tries to
let you make use of other CF attributes that xarray ignores. It attaches itself
to all xarray objects under the .cf namespace.
Where xarray allows you to specify dimension names for analysis, cf_xarray
lets you specify logical names like "latitude" or "longitude" instead as
long as the appropriate CF attributes are set.
For example, the "longitude" dimension in different files might be labelled as: (lon, LON, long, x…), but cf_xarray let’s you always refer to the logical name "longitude" in your code:
import cf_xarray
# describe cf attributes in dataset
ds.air.cf.describe()
Coordinates:
- CF Axes: * X: ['lon']
* Y: ['lat']
* T: ['time']
Z: n/a
- CF Coordinates: * longitude: ['lon']
* latitude: ['lat']
* time: ['time']
vertical: n/a
- Cell Measures: area, volume: n/a
- Standard Names: * latitude: ['lat']
* longitude: ['lon']
* time: ['time']
- Bounds: n/a
The following mean operation will work with any dataset that has appropriate
attributes set that allow detection of the “latitude” variable (e.g.
units: "degress_north" or standard_name: "latitude")
# demonstrate equivalent of .mean("lat")
ds.air.cf.mean("latitude")
<xarray.DataArray 'air' (time: 2920, lon: 53)>
array([[279.39798, 279.6664 , 279.66122, ..., 279.9508 , 280.31522,
280.6624 ],
[279.05722, 279.538 , 279.7296 , ..., 279.77563, 280.27002,
280.79764],
[279.0104 , 279.2808 , 279.5508 , ..., 279.682 , 280.19763,
280.81403],
...,
[279.63 , 279.934 , 280.534 , ..., 279.802 , 280.346 ,
280.77798],
[279.398 , 279.66602, 280.31796, ..., 279.766 , 280.34198,
280.834 ],
[279.27 , 279.354 , 279.88202, ..., 279.42596, 279.96997,
280.48196]], dtype=float32)
Coordinates:
* lon (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
* time (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00# demonstrate indexing
ds.air.cf.sel(longitude=242.5, method="nearest")
<xarray.DataArray 'air' (time: 2920, lat: 25)>
array([[241. , 238. , 239.7 , ..., 292. , 293.9 ,
296.79 ],
[240. , 238.39 , 241.09999, ..., 292.6 , 294.1 ,
296.69998],
[240.7 , 238.89 , 240.79999, ..., 292.29 , 293.4 ,
296.1 ],
...,
[241.79 , 243.48999, 246.48999, ..., 294.69 , 296.69 ,
298.49 ],
[239.89 , 241.68999, 242.29 , ..., 295.09 , 296.88998,
298.59 ],
[239.59 , 241.48999, 240.79 , ..., 295.19 , 296.79 ,
298.88998]], dtype=float32)
Coordinates:
* lat (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
lon float32 242.5
* time (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
Attributes:
long_name: 4xDaily Air temperature at sigma level 995
units: degK
precision: 2
GRIB_id: 11
GRIB_name: TMP
var_desc: Air temperature
dataset: NMC Reanalysis
level_desc: Surface
statistic: Individual Obs
parent_stat: Other
actual_range: [185.16 322.1 ]
who_is_awesome: xarrayOther cool packages#
xgcm : grid-aware operations with xarray objects
xrft : fourier transforms with xarray
xclim : calculating climate indices with xarray objects
intake-xarray : forget about file paths
rioxarray : raster files and xarray
xesmf : regrid using ESMF
MetPy : tools for working with weather data
Check Xarray documentation for even more! https://xarray.pydata.org/en/stable/related-projects.html
More information#
A description of common terms used in the xarray documentation: https://xarray.pydata.org/en/stable/terminology.html
For information on how to create a DataArray from an existing numpy array: https://xarray.pydata.org/en/stable/data-structures.html#creating-a-dataarray
Answers to common questions on “how to do X” are here: https://xarray.pydata.org/en/stable/howdoi.html
Ryan Abernathey has a book on data analysis with a chapter on Xarray: https://earth-env-data-science.github.io/lectures/xarray/xarray_intro.html